Method and system for determining a measure of overlap between data entries

ABSTRACT

A data analysis system and method for determining a measure of overlap between data entries in a number N of columns in a database. Sorted hash lists are provided. An N×N matrix having cells C ij  is provided. A set of N indexed read pointers pointing to the hash lists are provided. Each read pointer points to the first entry of the associated hash list. The value of cells C ij  in the matrix having indices i,j, wherein i and j each correspond to any of the index numbers of the read pointers pointing to the lowest value are incremented. The read pointer(s) pointing to the lowest value are incremented to point to the next different hash value(s). This is repeated until the last read pointer(s) points to the last entry of the associated hash list.

This application is a continuation application of original applicationSer. No. 14/215,292 filed on Mar. 17, 2014, and allowed on Jul. 14,2017.

FIELD OF THE INVENTION Field of the Invention

The present invention concerns a data analysis system.

More specifically, embodiments of the present invention concern methodsand apparatus for processing data to determine a measure of overlapbetween data entries in a number of columns in one or more databases.

BACKGROUND TO THE INVENTION

These days much data is generated and stored in digital form. Since the1980s the world's capacity to digitally store information has increasedby over twenty percent per year. In 2012 every day 2.5 exabytes(2.5×10¹⁸) of data were created every day. Some parts of this data ispublicly available, other parts are in-company data.

The term ‘big data’ is often used in this connection for a collection ofdata so large and complex that it becomes difficult to process usingon-hand database management tools or traditional data processingapplications.

This data is often heterogeneous with many interconnections anddependencies (relations) or and/or correlations. Large collections ofdata relations contain valuable information, but these relations need tobe ordered and structured before the actual patterns present in the datacan be easily disclosed. It is desirable to leverage the valuable andoften unknown information contained in this data. For example thisallows data analysis where none currently takes place. Nevertheless,this requires assessing millions of data points within an acceptableperiod of time.

Much of this data is stored in large databases, sometimes referred to asdata warehouses. Such databases can store thousands of columns of dataentries. The total number of data entries in such database can bemillions or even billions.

The database can for instance store in-company data, such as clientdata. Such client data can be distributed over columns storing all kindsof information. Some groups of columns can relate to personal data suchas first names, last names, social security numbers, phone numbers,email addresses, IF-addresses, street addresses, postal codes, citynames, state names, country names, etc. Other groups of columns canrelate to financial information such as bank account numbers, creditcard numbers, etc. Yet other groups of columns can relate to productsoffered by a company, such as financial products such as mortgage types,savings account types, loan types, credit types, clients making use ofsuch products, etc. Yet other groups of columns can relate to insuranceproducts such as car insurance types, health insurance types, lifeinsurance types, home insurance types, liability insurance types,clients making use of such products, etc. Also, the database can containadditional columns. For instance in relation to car insurance types thedatabase can also include columns relating to car makes, car types,gasoline consumption, CO₂ emissions, car weight, etc.

The vastness of the amount of data stored in the database can makeassessing interrelations between separate columns of data virtuallyimpossible or at least very complex and time consuming. This may causethat a first department within a company, e.g. a financial department,is unaware of data stored by a second department, e.g. an insurancedepartment. From a business perspective it would be highly desirablethat separate departments can benefit from data stored by otherdepartments.

From a marketing perspective it can also be desirable to be able tocombine and/or compare databases of different companies, e.g. of a bankor insurance company and a telecom provider.

In view of the above a data analysis system is desirable which assistsin assessing relations between columns of data in a database.

SUMMARY OF THE INVENTION

In accordance with a first aspect of the present invention there isprovided a computer implemented method of determining a measure ofoverlap between data entries in a number of columns in one or moredatabases. The method includes the following steps:

-   -   a) For each one of N columns a hash list is created comprising        for each data entry in the column a hash value representative of        said data entry, the values in the hash list being sorted        according to the hash values in the list.    -   b) A matrix is created having N columns and N rows (N×N matrix).        The matrix has cells identified by C_(ij), wherein i represents        the column number and j represents the row number of the cell in        the matrix.    -   c) A set of N indexed read pointers is assigned. Each read        pointer points to a single associated sorted hash list.    -   d) Each read pointer is set to the first entry of the associated        hash list.    -   e) The index number(s) of the read pointer(s) pointing to the        lowest hash value when processing the lists in ascending order,        or the index number(s) of the read pointer(s) pointing to the        highest hash value when processing the lists in descending order        are determined.    -   f) The value of cells C_(ij) in the matrix having indices i,j,        wherein i and j each correspond to any of the index numbers        determined in step e) are incremented.    -   g) The read pointer(s) having the index number(s) determined in        step e) are incremented to point to the next different hash        value(s) and steps e) and f) are repeated.    -   h) Step g) is repeated until the last read pointer(s) points to        the last entry of the associated hash list.

Thus, a matrix is generated which provides a measure of overlap betweenindividual columns in the one or more databases. It will be appreciatedthat the cell values on the diagonal, i.e. C_(ij) with i equal to j,give a measure of the number of unique entries in the column i of thedatabases. It will be appreciated that the cell values off-diagonal,i.e. C_(ij) with i unequal to j, give a measure of the number of uniquevalues that are shared between columns i and j of the databases.

Providing the matrix provides the advantage that a measure of overlapbetween different columns in the database(s) is expressed as a singlevalue, without having to assess the actual individual data entries inthe columns. Therefore, columns sharing data entries can be identifiedand ranked in an easy manner. Moreover, providing the matrix providesthe advantage that a measure of overlap between different columns in thedatabase(s) can be determined on simple numerical basis from the cellvalues C_(ij), without having to take into account values and/or typesof the underlying raw data.

Moreover, the matrix is generated in a highly efficient manner byprocessing and comparing all columns in parallel. This greatly reducesthe time in which the matrix is generated, which is of importance whenassessing large databases, for example a database containing on theorder of 20,000,000,000 separate data entries in 20,000 columns.

Once the matrix has been generated, it is possible to evaluate thecontents of columns of the database(s) by looking at the cell values ofthe matrix. Therefore, overlap of data in columns and other similaritiesor correlations can easily be obtained by looking at a number of cellvalues that is far less than the millions or billions (or more) of dataentries. Hence, the evaluation can be performed much faster and withmuch less load on the system.

Also, the matrix allows evaluating overlap of data in columns and othersimilarities or correlations without actual access to the underlyingdata entries. Therefore, it is also possible to outsource evaluation tothird parties while maintaining the underlying data entries secure orsecret.

Optionally, the matrix contains integer values. Optionally, the matrixis empty at the start of the process, i.e. each cell then contains thevalue zero. In step f) the values of those cells are optionallyincreased by one. Hence, the cell values on the diagonal, i.e. C_(ij)with i equal to j, give the number of unique entries in the column i ofthe databases, and the cell values off-diagonal, i.e. C_(ij) with iunequal to j, give the number of unique values that are shared betweencolumns i and j of the databases. It will be appreciated that whenprocessing all lists in ascending order the first entry is the lowesthash value, and when processing all lists in descending order the firstentry is the highest hash value.

Providing for each one of N columns a hash list comprising for each dataentry in the column a hash value representative of said data entryprovides the advantage that columns of dissimilar content or format canstill easily be compared. The hash value may be obtained by any knownhash function. Examples are SHA (Secure Hash Algorithm) functions, MD5(Message Digest Algorithm 5) functions, removing of spaces, conversionto lowercase, etc.

The step a) may include for each hash list discarding identical values,which simplifies the step of incrementing the read pointers to point tothe next different hash value(s).

The step f) may include incrementing the value of the cells C_(ij) inthe matrix having indices i,j, wherein i,j corresponds to each uniquecombination of the index numbers determined in step e), including thecases i=j. In this way, only half of the matrix (relative to thediagonal) is updated, which makes use of the notion that the matrix ofvalues is symmetrical with respect to the diagonal. This can savevaluable computing time when processing large amounts of data.

The step f) may include incrementing the value of all cells C_(ij) inthe matrix having indices i,j, wherein i and j each corresponding to anyof the index numbers determined in step e). In this way all of the cellsof the matrix are updated.

All hash values within a hash list may have the same length. All hashvalues in all hash lists may have the same length. This renderscomparing of the hash values in different columns simpler and moreefficient.

The method may including the step of determining a desired level ofaccuracy and creating the hash values having a length that has beendetermined on the basis of the desired level of accuracy. Creating foreach data entry in the columns a hash value representative of said dataentry may involve loss of accuracy, especially when the hash values areshorter than the data entries. Normally, the longer the hash values are,the more accurate the hash value represents the original data entry inthe column in the database. It is also possible to deliberately set thedesired level of accuracy relatively low, so as to allow similar, butnon-identical, data entries to be represented by identical hash valuesso as to determine a measure of overlap of similar data entries betweendifferent columns. Herein similar indicates that the data entries have apartial overlap but are not fully identical. This can e.g. be used toaccount for spelling errors and/or typographical errors.

The method may further include after step h) determining the cellC_(ij), with i unequal to j, having the highest value. This cell C_(ij)having the highest value signifies the combination of columns i and jhaving the largest number of data entries in common. The method mayfurther include after step h) determining the cells C_(ij) having avalue that is higher than a threshold value. This yields the columnshaving more data entries in common than specified by the thresholdvalue.

The method may further include after step h) normalizing the values inthe cells of the matrix by dividing the value of each cell C_(ij) by thevalue of C. The cells C_(ij) with i>j signify the percentage of overlapof values in column i found in column j, the cells C_(ij) with i<jsignify the percentage of overlap of values in column j found in columni. It will be appreciated that the matrix containing the normalizedvalues in the cells is not necessarily symmetrical relative to thediagonal. The method may further include determining the cells C_(ij)having a normalized value that is higher than a threshold value.

The method may further include after step h) processing the values ofthe cells in row p and/or column p by dividing the value of cells C_(xy)by the value of cells C_(yx), The value of the thus processed value ofC_(ij) signifies the ratio of the amount of values present in column irelative to the amount of values present in column j. The cell C_(ij) orC_(ji) having the largest normalized value and the processed valueclosest to one indicates the column j being the closest subset orsuperset of column i.

The present method is especially well suited for handling big data. Thenumber N of columns may be more than 1,000, more than 10,000, and evenmore than 100,000.

The method my further include the step of including in the number N ofcolumns at least one column containing data entries of a known type. Aswill be discussed below, this allows to determine a type of data entriesin a column of unknown data type.

The step g) may include when the read pointer was not already pointingto the last entry in the hash list: incrementing the read pointer(s)pointing to the hash value determined in step e) to point to the nextentry in the respective hash list, and when the read pointer was alreadypointing to the last entry in the hash list: not incrementing the readpointer(s) pointing to the hash value determined in step e), orincrementing the read pointer(s) pointing to the hash value determinedin step e) to point to outside the respective hash list, and ignoringthe index number of that read pointer(s) for subsequent steps f) and g).

The method may further include after step h):

-   -   i) receiving a further number M of columns and creating M        further sorted hash lists comprising for each data entry in the        further column a hash value representative of said data entry;    -   j) adding N+1^(th) to N+M^(th) columns and N+1^(th) to N+M^(th)        rows to the matrix;    -   k) assigning a set of N+M indexed read pointers, each read        pointer pointing to a single associated sorted hash list, the        N+1^(th) to N+M^(th) read pointers pointing to the further hash        lists;    -   l) setting each read pointer to the first entry of the        associated hash list;    -   m) determining the index number(s) of the read pointer(s)        pointing to the lowest hash value when processing the lists in        ascending order, or determining the index number(s) of the read        pointer(s) pointing to the highest hash value when processing        the lists in descending order;    -   n) incrementing the value of cells C_(ij) in the matrix having        indices i,j, wherein i and j each correspond to any of the index        numbers determined in step m) and wherein at least one of i and        j is in the range of N+1 to N+M;    -   o) incrementing the read pointer(s) having the index number(s)        determined in step m); and    -   p) repeating step o) until the last read pointer(s) points to        the last entry of the associated hash list. Hence, one or        more (M) columns can be added to previously assessed N columns,        without determining the entire matrix anew. This can save        valuable computing time when processing large amounts of data.        Moreover, this allows gradual buildup of the matrix for very        large data sets. This also allows periodical update of the        matrix for expanding data sets.

A type of data entries in at least one of the columns may be one offirst name, last name, social security number, phone number, emailaddress, IP-address, street address, postal code, city, country, bankaccount number, credit card number.

Although, as described above, providing for each data entry in thecolumns a hash value representative of said data entry may beadvantageous, this is not always essential. Thus, more in generalaccording to a second aspect of the invention there is provided a methodof determining a measure of overlap between data entries in a number Nof columns in one or more databases, including:

-   -   a) retrieving for each column a list comprising for each data        entry in the column a value representative of said data entry,        the values in the list being sorted;    -   b) creating an N×N matrix, having cells C_(ij), wherein i        represents the column number and j represents the row number of        the cell in the matrix;    -   c) assigning a set of N indexed read pointers, each read pointer        pointing to a single associated sorted list;    -   d) setting each read pointer to the first entry of the        associated list;    -   e) determining the index number(s) of the read pointer(s)        pointing to the lowest value when processing the lists in        ascending order, or determining the index number(s) of the read        pointer(s) pointing to the highest value when processing the        lists in descending order;    -   f) incrementing the value of cells C_(ij) in the matrix having        indices i,j, wherein i and j each correspond to any of the index        numbers determined in step e);    -   g) incrementing the read pointer(s) having the index number(s)        determined in step (e) to point to the next different value(s)        and repeating steps e) and f); and    -   h) repeating step g) until the last read pointer(s) points to        the last entry of the associated list.

As set out above, herein the step a) of retrieving the lists may includecreating for each column a hash list comprising for each data entry inthe column a hash value representative of said data entry.

According to a third aspect of the invention there is provided a methodof determining a type of data entries in a to-be-assessed column in adatabase. According to this aspect, again a number of sorted lists isprovided. This may be sorted hash lists. One of the lists corresponds tothe to-be assessed column. One or more other lists correspond to sorted(hash) lists containing (hash) values representative of data entries ofknown types thus forming a number N of sorted (hash) lists. For theselists the N×N matrix is generated according to the steps a) through h)as described above with respect to the first and second aspects of thepresent invention. From this matrix it is determining which cell C_(pq)and/or C_(qp) indicates closest conformity between columns p and q,wherein the index p corresponds to the to-be-assessed column and index qcorresponds to a list associated with data entries of known types. Thetype of the data entries in the to-be-assessed column is determined tobe similar, or identical, to the known type of the data entries in thelist corresponding to the index q.

Determining which cell C_(pq) and/or C_(qp) indicates closest conformityincludes determining which cell C_(pq) and/or C_(qp) has the highestvalue. The highest value indicates the list q having the largest numberof data entries in common with column p.

The method may further include after step h) normalizing the values inthe cells in column p of the matrix by dividing the value of each cellC_(pj) by the value of C_(pp), wherein determining which cell C_(pq)indicates closest conformity includes determining which cell C_(pq) hasthe highest normalized value. The highest normalized value indicates thelist q having the largest percentage of data entries in common withcolumn p.

The method may further include after step h) normalizing the values inthe cells in row p of the matrix by dividing the value of each cellC_(ip) by the value of C_(pp), wherein determining which cell C_(qp)indicates closest conformity includes determining which cell C_(qp) hasthe highest normalized value. The highest normalized value indicates thelist q having the largest percentage of data entries in common withcolumn p.

The method may further include after step h) processing the values ofthe cells in row p and/or column p by dividing the value of cellsC_(xy), by the value of cells C_(yx), The value of the thus processedvalue of C_(ij) signifies the ratio of the amount of values present incolumn i relative to the amount of values present in column j. The cellC_(pq) or C_(qp) having the largest normalized value and the processedvalue closest to one indicates the column q being the closest subset orsuperset of column p.

According to a fourth aspect of the invention there is provided a dataanalysis system for determining a measure of overlap between columns ofdata entries, including:

-   -   at least one database storing a number N of columns of data        entries;    -   a first memory;    -   a second memory; and    -   a processing module arranged for    -   a) retrieving for each column a list comprising for each data        entry in the column a value representative of said data entry,        the values in the list being sorted and storing the lists in the        first memory;    -   b) creating an N×N matrix, having cells C_(ij), wherein i        represents the column number and j represents the row number of        the cell in the matrix, and storing the matrix in the second        memory;    -   c) assigning a set of N indexed read pointers, each read pointer        pointing to a single associated sorted list;    -   d) setting each read pointer to the first entry of the        associated list;    -   e) determining the index number(s) of the read pointer(s)        pointing to the lowest value when processing the lists in        ascending order, or determining the index number(s) of the read        pointer(s) pointing to the highest value when processing the        lists in descending order;    -   f) incrementing the value of cells C_(ij) in the matrix in the        second memory having indices i,j, wherein i and j each        correspond to any of the index numbers determined in step e);    -   g) incrementing the read pointer(s) having the index number(s)        determined in step (e) to point to the next different value(s)        and repeating steps e) and f); and    -   h) repeating step h) until the last read pointer(s) points to        the last entry of the associated list.

All optional measures described above in relation to the first, secondand third aspect of the invention apply equally well to this fourthaspect of the invention. For instance, the processing module may furtherbe arranged for creating for each column a hash list comprising for eachdata entry in the column a hash value representative of said data entry;and sorting the hash lists. The invention also relates to a dataanalysis system having a processing module arranged for performing thesteps of the methods of the first, second and third aspect of theinvention.

According to a fifth aspect of the invention there is provided anon-transient computer readable medium storing computer implementableinstructions which when implemented by a programmable computer cause thecomputer to determine a measure of overlap between data entries in anumber N of columns in one or more databases, by performing the stepsof:

-   -   a) retrieving for each column a list comprising for each data        entry in the column a value representative of said data entry,        the values in the list being sorted;    -   b) creating an N×N matrix, having cells C_(ij), wherein i        represents the column number and j represents the row number of        the cell in the matrix;    -   c) assigning a set of N indexed read pointers, each read pointer        pointing to a single associated sorted list;    -   d) setting each read pointer to the first entry of the        associated list;    -   e) determining the index number(s) of the read pointer(s)        pointing to the lowest value when processing the lists in        ascending order, or determining the index number(s) of the read        pointer(s) pointing to the highest value when processing the        lists in descending order;    -   f) incrementing the value of cells C_(ij) in the matrix having        indices i,j, wherein i and j each correspond to any of the index        numbers determined in step e);    -   g) incrementing the read pointer(s) having the index number(s)        determined in step (e) to point to the next different value(s)        and repeating steps e) and f); and    -   h) repeating step g) until the last read pointer(s) points to        the last entry of the associated list.

In a further aspect there is also provided a non-transient computerreadable medium storing computer implementable instructions which wheninterpreted by a programmable computer cause the computer to becomeconfigured as a data analysis system according to the fourth aspect.

All optional measures described above in relation to the first, second,third and fourth aspect of the invention apply equally well to each ofthe first through fifth aspects of the invention. The invention alsorelates to computer program product comprising code portions arrangedfor, when executed on a programmable computer, performing the steps ofthe methods of the first, second and third aspect of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present invention will now be described in detailwith reference to the accompanying drawings in which:

FIG. 1 is a schematic block diagram of a data analysis system inaccordance with an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a method in accordance with anembodiment of the invention;

FIG. 3 is a schematic flow chart of a method in accordance with anembodiment of the invention;

FIGS. 4a-4c are a schematic representation of steps in a method ofgenerating hash lists in accordance with an embodiment of the invention;and

FIGS. 5a -5l are a schematic representation of steps in a method ofgenerating a matrix in accordance with an embodiment of the invention.

DETAILED DESCRIPTION

FIG. 1 is a schematic block diagram of a data analysis system 1 inaccordance with an embodiment of the present invention. The dataanalysis system 1 comprises, or is associated with, a database 2. Thedata analysis system 1 may also comprise, or be associated with, aplurality of databases 2. The database(s) 2 includes a plurality ofcolumns 4.n (n=1, 2, 3, 4, . . . ) of data entries. A number of columnsof the database or databases will be processed by the data analysissystem 1. This number of columns is denoted by N. The data analysissystem 1 includes a processing module 10. As will be described, theprocessing module 10 is arranged for determining a measure of overlapbetween the columns 4.n in the database(s) 2 in a highly efficientmanner. To that end, the processing module includes a retrieval unit 12arranged for retrieving, or receiving, columns 4.n of data entries fromthe database 2. In this example, the processing module 10 furtherincludes a hashing unit 14 arranged for creating for each column 4.n ahash list comprising for each data entry in the column a hash valuerepresentative of said data entry. In this example, the processingmodule 10 further includes a sorting unit 16 arranged for sorting thedata in the lists. In this example the sorting unit 16 is furtherarranged for discarding identical values from the lists. The processingmodule 10 further includes a first memory 18 for storing the lists.

The processing module 10 further includes a matrix creation unit 20arranged for creating a matrix. The number of columns in the matrixcorresponds to the number N of columns to be processed. The number ofrows in the matrix corresponds to the number N of columns to beprocessed. Thus, the matrix is an N×N matrix, having cells C_(ij),wherein i represents the column number and j represents the row numberof the cell in the matrix. The processing module 10 further includes asecond memory 22 for storing the matrix.

The processing module 10 further includes a processing unit 24. Theprocessing unit 24 is arranged for assigning a set of N indexed readpointers. Each read pointer is assigned to point to a single associatedsorted list in the first memory 18. The processing unit 24 is furtherarranged for setting each read pointer to the first entry of theassociated list. In this example, the sorted hash lists are beingprocessed in ascending order, therefore for each list the first value isthe lowest value of that list. In this example, the processing unit 24is further arranged for determining the index number(s) of the readpointer(s) pointing to the lowest value in the first memory 18. Theprocessing unit 24 is arranged for incrementing the value of cellsC_(ij) in the matrix in the second memory 22 having indices i,j, whereini and j each correspond to any of the index numbers of the pointer(s)pointing to the lowest value. The processing module 10 further includesa read pointer incrementing unit 26 arranged for incrementing the readpointer(s) pointing to the lowest value to point to the next, higher,value(s).

In this example, the data analysis system 1 further includes apresentation unit 28, such as a screen or monitor. The presentation unit28 may be used to display results of the processing by the processingmodule 10 to a user of the system 1. In this example, the data analysissystem 1 further includes an input unit 30, such as a keyboard, mouse,touchscreen or the like, for inputting commands to the processing module10.

The data analysis system 1 as described thus far can be used accordingto the following method. Reference is made to FIG. 2 which is aschematic flow chart of a method in accordance with an embodiment of theinvention. In step 200 the retrieval unit 12 retrieves, or receives, theN columns from the one or more databases 2. FIG. 4a shows an example offour columns of data retrieved from a database 2. In step 202 thehashing unit 14 creates for each column a hash list comprising for eachdata entry in the column a hash value representative of said data entry.FIG. 4b shows an example of data in the columns of FIG. 4a having beenhashed to hash values. In step 204 the sorting unit 16 sorts the valuesin the hash list according to the hash values in the list. In thisexample, the sorting unit 16 in step 204 for each list also discardsidentical values, so that each value is included in the list only once.FIG. 4c shows an example of the lists of hash values of FIG. 4b havingbeen sorted and duplicate hash values having been removed.

It will be appreciated that in this example the processing module 10retrieves, or receives, columns of data entries from the database(s) andprocesses these columns into sorted hash lists. It will be appreciatedthat it is also possible that the processing module 10 retrieves, orreceives, pre-processed sorted hash lists. In that case the steps 202and 204 are omitted.

In step 206 the matrix creation unit 20 creates the N×N matrix andstores the matrix in the second memory 22. FIG. 5 shows on the left handside the four sorted hash lists of FIG. 4c and on the right hand sidethe created 4×4 matrix. The matrix has cells C_(ij), wherein irepresents the column number and j represents the row number of the cellin the matrix. The column and row numbers are indicated in FIG. 5a . Thematrix is empty, that is all values are set to zero, in the example ofFIG. 5 a.

In step 208 the processing unit 24 assigns N read pointer. Each readpointer points to a single associated hash list in the first memory 18.In step 210 each read pointer is set to point to the first entry of theassociated hash list. In FIG. 5b the entry in the hash list to which therespective read pointer points is indicated by a black background. Itwill be appreciated that in FIG. 5b all read pointers point to the firstentries of all respective hash lists.

In step 212 the processing unit 24 determines the index number(s) of theread pointer(s) pointing to the lowest hash value. In the example ofFIG. 5b the read pointers pointing to the lists numbered 1, 2 and 4point to the value “A”, whereas the read pointer pointing the listnumbered 3 points to the value “C”. Therefore, the processing unit 24determines that read pointers with index numbers 1, 2 and 4 point to thelowest hash value. Next, in step 214 the processing unit 24 incrementsthe value of cells C_(ij) in the matrix, wherein i and j each correspondto any of the determined index numbers 1, 2 and 4. In FIG. 5b theprocessing unit 24 thus increments the cells C₁₁, C₁₂, C₁₄, C₂₁, C₂₂,C₂₄, C₄₁, C₄₂, and C₄₄. In this example, the cell values are incrementedby one.

In step 216 the processing unit 24 determines whether or not all hashvalues in all lists have been processed yet. Since in the state shown inFIG. 5b not all hash values have been processed yet, in step 218 theread pointer incrementing unit 26 increments the read pointers havingthe just determined index numbers to point to the next different hashvalue(s). This is shown in FIG. 5c . The read pointers 1, 2 and 4 thatpointed to the value “A” in FIG. 5b are incremented to point to the nextentry in the respective hash lists.

Then the process is repeated. In step 212 the processing unit 24determines the index number(s) of the read pointer(s) pointing to thelowest hash value. In the example of FIG. 5c the read pointers pointingto the lists numbered 1 and 4 point to the value “B”, whereas the readpointer pointing the lists numbered 2 and 3 points to the value “C”.Therefore, the processing unit 24 determines that read pointers withindex numbers 1 and 4 point to the lowest hash value. Next, in step 214the processing unit 24 increments the value of cells C_(ij) in thematrix, wherein i and j each correspond to any of the determined indexnumbers 1 and 4. In FIG. 5c the processing unit 24 thus increments thecells C₁₁, C₁₄, C₄₁, and C₄₄.

This process is repeated throughout FIGS. 5d -5 j. In FIG. 5i the readpointers all point to the last entries the associated hash lists. Theread pointers with index 1, 2 and 3 point to the lowest value “H”. Instep 218 now

These read pointers are incremented to point to outside the respectivehash lists. The index numbers of these read pointers are ignored whenincrementing cells in the matrix in FIG. 5j . Instead of incrementingthese read pointers to point outside the respective hash lists, it isalso possible to refrain from incrementing these read pointers andignoring the index numbers of these read pointers when incrementingcells in the matrix in FIG. 5j . In FIG. 5j the last read pointers(index number 4) points to the last entry “I” of the associated hashlist. The resulting matrix is also shown in FIG. 5j . The resultingmatrix can be presented to a user of the system, e.g. via thepresentation unit 28.

It will be appreciated that the matrix is generated in a highlyefficient manner by processing and comparing all columns in parallel.This greatly reduces the time in which the matrix is generated, which isof importance when assessing large databases. In the example of FIGS. 4and 5 the database contains four columns of at most thirteen dataentries. It will be appreciated that these extremely low numbers arejust for demonstrating the underlying principle in a clear and concisemanner. In more practical applications the database can contain tens ofthousands or more columns and millions or billions or more separate dataentries.

The resulting matrix can also be used for further analysis. The valuesC_(ij), with i=j, on the diagonal represent the number of unique valueson each hash list. For example, in FIG. 5j C₁₁ has the value “8”corresponding to the number of unique values in the first hash list.Thus also the number of unique values in the first column is eight.

The off-diagonal values, i.e. C_(ij) with i#j, signify the number ofentries that the columns i and rows j have in common. Therefore, theoff-diagonal cell with the highest value signifies the combination ofcolumns i and j having the largest number of data entries in common. InFIG. 5j cells C₁₂ and C₂₁ have the value “5”, indicating that columns 1and 2 have five entries in common. In FIG. 5j cells C₃₄ and C₄₃ have thevalue “0”, indicating that columns 3 and 4 have no entries in common.

The processing unit 24 may further be arranged for normalizing thevalues in the cells of the matrix by dividing the value of each cellC_(ij) by the value of C_(ii). FIG. 5k shows the matrix of FIG. 5j thathas been normalized in this way. The normalized cells C_(ij) with i>jsignify the percentage of overlap of values in column i found in columnj. The normalized cells C_(ij) with i<j signify the percentage ofoverlap of values in column j found in column i. For example, the valueof C₂₁ is “1”, indicating that 100% of the entries of column 2 is alsoincluded in column 1. The value of C₂₁ on the other hand is “0.625”indicating that 62.5% of the entries of column 1 is also included incolumn 2. Thus, clearly column 2 is a subset of column 1. It will beappreciated that the matrix containing the normalized values in thecells is not necessarily symmetrical relative to the diagonal.

The processing unit 24 may further be arranged for processing the cellvalues as shown in FIG. 5j by dividing the value of cells C_(xy) by thevalue of cells C_(yx) (division by zero may need to be excluded). FIG.5l shows the matrix of FIG. 5j that has been processed in this way. Theprocessed cells C_(ij) signify the ratio of the amount of values presentin column i relative to column j. For example, the value of C₃₂ is“1.25”, indicating that column 2 includes 25% more data entries thancolumn 3. The value of C₂₃ on the other hand is “0.8” indicating theamount of data entries in column 3 is 80% of the amount of data entriesin column 2. The cell C_(ij) or C_(j), having the largest normalizedvalue and the processed value closest to one indicates the column jbeing the closest subset or superset of column i.

Results of such further analysis of the matrix as described above can bepresented to a user of the system, e.g. via the presentation unit 28.

If a matrix has been determined for a set of N columns it is possible toadd one or more columns to the set of columns and expanding the matrixto also include cell values for these added columns. Then, the retrievalunit 12 retrieves, or receives the further columns. For example a numberM columns can be added to the original N columns. The hashing unit 14and sorting unit 16 create the sorted hash lists for the additional Mcolumns. The matrix creation unit 20 adds N+1^(th) to N+M^(th) columnsand N+1^(th) to N+M^(th) rows to the matrix. Hence, an (N+M)×(N+M)matrix is obtained for the N+M columns. The processing unit 24 assigns aset of M additional indexed read pointers in addition to the original Nread pointers. Each read pointer points to a single associated sortedhash list, the N+1^(th) to N+M^(th) read pointers pointing to thefurther hash lists.

In step 212 the processing unit 24 determines the index number(s) of theread pointer(s) pointing to the lowest hash value. In step 214 the valueof cells C_(ij) in the matrix having indices i,j, wherein i and j eachcorrespond to any of the index numbers of the read pointers pointing tothe lowest value are incremented, but only for the cells for which atleast one of i and j is in the range of N+1 to N+M. The read pointer(s)pointing to the lowest hash value are incremented. This process isrepeated until the last read pointers points to the last entry of theassociated hash list. Thus, the original N×N matrix has been expanded tothe (N+M)×(N+M) matrix.

The system 1 and method described thus far can also be used fordetermining a type of data entries in one or more to-be-assessed columnsin a database. Reference is made to FIG. 3. Thereto besides retrieving,or receiving, the to-be-assessed columns in step 200A also one or morecolumns containing data entries of known types are retrieved, orreceived, in step 200B thus forming a number N of columns. These Ncolumns are processed as described above. Thus, optionally for theto-be-assessed columns a sorted hash list is created in steps 202A and204A, and for the columns of known types in steps 202B and 204B. Thematrix is created and filled in steps 206, 208, 210, 212, 214, 216 and218. Next, it is determined in step 220, e.g. by the processing unit 24,which cell C_(pq) and/or C_(qp) of the matrix indicates closestconformity between columns p and q, wherein the index p corresponds tothe to-be-assessed column or columns. The type of the data entries inthe to-be-assessed column is then determined to be similar to the knowntype of the data entries in the column corresponding to the other indexq. It will be appreciated that in this example the processing module 10retrieves, or receives, columns of data entries from the database(s)(steps 200A and 200B) and processes these columns into sorted hash lists(steps 202A, 204A, 202B and 204B). It will be appreciated that it isalso possible that the processing module 10 retrieves, or receives,pre-processed sorted hash lists. For instance, the columns of dataentries of known types may be retrieved, or received as sorted hashlists. Also the to-be-assessed columns may be retrieved, or received, assorted hash lists. The hash lists of the known types may e.g. be(permanently) stored in the first memory 18.

Determining which cell C_(pq) and/or C_(qp) indicates closest conformityfor example is done by determining which cell C_(pq) and/or C_(qp) hasthe highest value. The highest value indicates the list q having thelargest number of data entries in common with column p. A large numberof data entries of a known type corresponding to data entries of anunknown type may indicate a high chance, or correlation, that theunknown type is similar or identical to this known type.

Alternatively, or additionally, the values in the cells in column p ofthe matrix are normalized by dividing the value of each cell C_(pj) bythe value of C_(pp). Determining which cell C_(pq) indicates closestconformity then for example is done by determining which cell C_(pq) hasthe highest normalized value. The highest normalized value indicates thelist q having the largest percentage of data entries in common withcolumn p. A large percentage of data entries from a list of a known typecorresponding to data entries of an unknown type may indicate a highchance, or correlation, that the unknown type is similar or identical tothis known type.

Alternatively, or additionally, the values of the cells in row p and/orcolumn p are processed by dividing the value of cells C_(xy) by thevalue of cells C_(yx), The value of the thus processed value of C_(ij)signifies the ratio of the amount of values present in column i relativeto the amount of values present in column j. The cell C_(pq) or C_(qp)having the largest normalized value and the processed value closest toone indicates the column q being the closest subset or superset ofcolumn p.

In the foregoing, the invention has been described with reference tospecific examples of embodiments of the invention. It will, however, beevident that various modifications and changes may be made therein,without departing from the essence of the invention. For the purpose ofclarity and a concise description features are described herein as partof the same or separate embodiments, however, alternative embodimentshaving combinations of all or some of the features described in theseseparate embodiments are also envisaged.

It will be appreciated that the retrieval unit, hashing unit, sortingunit, discarding unit, processing unit, matrix creation unit, and readpointer indexing unit can be embodied as dedicated electronic circuits,possibly including software code portions. The retrieval unit, hashingunit, sorting unit, discarding unit, processing unit, matrix creationunit, and read pointer indexing unit can also be embodied as softwarecode portions executed on, and e.g. stored in, a memory of, aprogrammable apparatus such as a computer.

In the example the first memory 18 and the second memory 22 are part ofthe processing module 10. It will be appreciated that it is alsopossible that the first and/or second memory is included in a separateunit associated with the processing module. It is also possible that thefirst and second memory are both parts of one and the same memory.

In the examples, the sorted lists are processed in an ascendingdirection. It will be appreciated that it is also possible to processthe sorted lists in a descending direction. Then, the processing unitstarts by determining the index number(s) of the read pointer(s)pointing to the highest value in the first memory. The processing unitthen increments the value of cells C_(ij) in the matrix in the secondmemory having indices i,j, wherein i and j each correspond to any of theindex numbers of the pointer(s) pointing to the highest value. The readpointer incrementing unit then increments the read pointer(s) pointingto the highest value to point to the next, lower, value(s).

In the examples, the values of the cells of the matrix are incrementedby one. This may be beneficial so that integer values can be used. Itwill be appreciated that the values can be incremented by other valuesas well.

In the example of FIGS. 5a-5j all values of the matrix are incrementedin step 214. It will be appreciated that the resulting matrix as shownin FIG. 5j is symmetrical with respect to the diagonal, that is,C_(xy)=C_(yx). Therefore, it is also possible that in step 214 only halfof the matrix is updated, for instance only the cells C_(ij) for whichor the cells C_(ij) for which Then still the normalized matrix as shownin FIG. 5k can de obtained, due to the known symmetry of the matrix asshown in FIG. 5 j.

Although the embodiments of the invention described with reference tothe drawings comprise computer apparatus and processes performed incomputer apparatus, the invention also extends to computer programs,particularly computer programs on or in a carrier, adapted for puttingthe invention into practice. The program may be in the form of source orobject code or in any other form suitable for use in the implementationof the processes according to the invention. The carrier may be anyentity or device capable of carrying the program.

For example, the carrier may comprise a storage medium, such as a ROM,for example a CD ROM or a semiconductor ROM, or a magnetic recordingmedium, for example a floppy disc or hard disk. Further, the carrier maybe a transmissible carrier such as an electrical or optical signal whichmay be conveyed via electrical or optical cable or by radio or othermeans, e.g. via the internet or cloud.

When a program is embodied in a signal which may be conveyed directly bya cable or other device or means, the carrier may be constituted by suchcable or other device or means. Alternatively, the carrier may be anintegrated circuit in which the program is embedded, the integratedcircuit being adapted for performing, or for use in the performance of,the relevant processes.

However, other modifications, variations, and alternatives are alsopossible. The specifications, drawings and examples are, accordingly, tobe regarded in an illustrative sense rather than in a restrictive sense.

For the purpose of clarity and a concise description features aredescribed herein as part of the same or separate embodiments, however,it will be appreciated that the scope of the invention may includeembodiments having combinations of all or some of the featuresdescribed.

In the claims, any reference signs placed between parentheses shall notbe construed as limiting the claim. The word ‘comprising’ does notexclude the presence of other features or steps than those listed in aclaim. Furthermore, the words ‘a’ and ‘an’ shall not be construed aslimited to ‘only one’, but instead are used to mean ‘at least one’, anddo not exclude a plurality. The mere fact that certain measures arerecited in mutually different claims does not indicate that acombination of these measures cannot be used to an advantage.

What is claimed is:
 1. A computer implemented method of determining ameasure of overlap between data entries in a number N of columns in oneor more databases, the method including the computer: a) creating foreach column a hash list comprising for each data entry in the column ahash value representative of said data entry, the values in the hashlist being sorted; b) creating an N×N matrix, having cells C_(ij) ,wherein i represents the column number and j represents the row numberof the cell in the matrix; c) assigning a set of N indexed readpointers, each read pointer pointing to a single associated sorted hashlist; d) setting each read pointer to the first entry of the associatedhash list; e) determining the index number(s) of the read pointer(s)pointing to the lowest hash value when processing the lists in ascendingorder, or determining the index number(s) of the read pointer(s)pointing to the highest hash value when processing the lists indescending order; f) incrementing the value of cells C_(ij) in thematrix having indices i,j, wherein i and j each correspond to any of theindex numbers determined in step e); g) incrementing the read pointer(s)having the index number(s) determined in step e) to point to the nextdifferent hash value(s) and repeating steps e) and f); and h) repeatingstep g) until the last read pointer(s) points to the last entry of theassociated hash list.
 2. The method of claim 1 wherein the step a)includes for each hash list discarding identical values.
 3. The methodof claim 1 wherein the step f) includes incrementing the value of thecells C_(ij) in the matrix having indices i,j, wherein i,j correspondsto each unique combination of the index numbers determined in step e),including the cases i=j.
 4. The method of claim 1 wherein the step f)includes incrementing the value of all cells C_(ij) in the matrix havingindices i,j, wherein i and j each corresponding to any of the indexnumbers determined in step e).
 5. The method of claim 1, wherein allhash values within a hash list have the same length.
 6. The method ofclaim 1, wherein all hash values in all hash lists have the same length.7. The method of claim 1, including the step of the computer determininga desired level of accuracy and creating the hash values having a lengthdetermined on the basis of the desired level of accuracy.
 8. The methodof claim 1, further including after step h) the computer determining thecell C_(ij), with i unequal to j, having the highest value.
 9. Themethod of claim 1, further including after step h) the computerdetermining the cell(s) C_(ij) having a value that is higher than athreshold value.
 10. The method of claim 1, further including after steph) the computer normalizing the values in the cells of the matrix bydividing the value of each cell C_(ij) by the value of C.
 11. The methodof claim 10, further including the computer determining the cell(s)C_(ij) having a normalized value that is higher than a threshold value.12. The method of claim 1, wherein the number N of columns is more than1000, more preferably more than 10000, even more preferably more than100000.
 13. The method of claim 1, further including the computerperforming the step of including in the number N of columns at least onecolumn containing data entries of a known type.
 14. The method of claim1, wherein step g) includes: when the read pointer was not alreadypointing to the last entry in the hash list: incrementing the readpointer(s) pointing to the hash value determined in step e) to point tothe next entry in the respective hash list, and when the read pointerwas already pointing to the last entry in the hash list: notincrementing the read pointer(s) pointing to the hash value determinedin step e), or incrementing the read pointer(s) pointing to the hashvalue determined in step e) to point to outside the respective hashlist, and ignoring the index number of that read pointer(s) forsubsequent steps f) and g).
 15. The method of claim 1, further includingafter step h) the computer: i) receiving a further column and creating afurther sorted hash list comprising for each data entry in the furthercolumn a hash value representative of said data entry; j) adding anN+1^(th) column and an N+1^(th) row to the matrix; k) assigning a set ofN+1 indexed read pointers, each read pointer pointing to a singleassociated sorted hash list, the N+1^(th) read pointer pointing to thefurther hash list; l) setting each read pointer to the first entry ofthe associated hash list; m) determining the index number(s) of the readpointer(s) pointing to the lowest hash value when processing the listsin ascending order, or determining the index number(s) of the readpointer(s) pointing to the highest hash value when processing the listsin descending order; n) incrementing the value of cells C_(ij) in thematrix having indices i,j, wherein i and j each corresponding to any ofthe index numbers determined in step m) and wherein at least one of iand j is equal to N+1; o) incrementing the read pointer(s) having theindex number(s) determined in step m) and repeating steps m) and n); andp) repeating step o) until the last read pointer(s) points to the lastentry of the associated hash list.
 16. The method of claim 1, furtherincluding after step h) the computer: i) receiving a further number M ofcolumns and creating M further sorted hash lists comprising for eachdata entry in the further column a hash value representative of saiddata entry; j) adding N+1 to N+M^(th) columns and N+1 to N+M^(th) rowsto the matrix; k) assigning a set of N+M indexed read pointers, eachread pointer pointing to a single associated sorted hash list, theN+1^(th) to N+M^(th) read pointers pointing to the further hash lists;l) setting each read pointer to the first entry of the associated hashlist; m) determining the index number(s) of the read pointer(s) pointingto the lowest hash value when processing the lists in ascending order,or determining the index number(s) of the read pointer(s) pointing tothe highest hash value when processing the lists in descending order; n)incrementing the value of cells C_(ij) in the matrix having indices i,j,wherein i and j each correspond to any of the index numbers determinedin step m) and wherein at least one of i and j is in the range of N+1 toN+M; o) incrementing the read pointer(s) having the index number(s)determined in step m) and repeating steps m) and n); and p) repeatingstep o) until the last read pointer(s) points to the last entry of theassociated hash list.
 17. The method of claim 1, wherein a type of dataentries in at least one of the columns is one of first name, last name,social security number, phone number, email address, IP-address, streetaddress, postal code, city, country, bank account number, credit cardnumber.
 18. A computer implemented method of determining a measure ofoverlap between data entries in a number N of columns in one or moredatabases, the method including the computer: a) retrieving for eachcolumn a list comprising for each data entry in the column a valuerepresentative of said data entry, the values in the list being sorted;b) creating an N×N matrix, having cells C_(ij), wherein i represents thecolumn number and j represents the row number of the cell in the matrix;c) assigning a set of N indexed read pointers, each read pointerpointing to a single associated sorted list; d) setting each readpointer to the first entry of the associated list; e) determining theindex number(s) of the read pointer(s) pointing to the lowest value whenprocessing the lists in ascending order, or determining the indexnumber(s) of the read pointer(s) pointing to the highest value whenprocessing the lists in descending order; f) incrementing the value ofcells C_(ij) in the matrix having indices i,j, wherein i and j eachcorrespond to any of the index numbers determined in step e); g)incrementing the read pointer(s) having the index number(s) determinedin step (e) to point to the next different value(s) and repeating stepse) and f); and h) repeating step g) until the last read pointer(s)points to the last entry of the associated list.
 19. A computerimplemented method of determining a type of data entries in ato-be-assessed column in a database, the method including the computer:a) creating for the to-be assessed column a to-be-assessed sorted hashlist comprising for each data entry in the column a hash valuerepresentative of said data entry; and providing one or more sorted hashlists containing hash values representative of data entries of knowntypes thus forming a number N of sorted hash lists; b) creating an N×Nmatrix, having cells C_(ij) , wherein i represents the column number andj represents the row number of the cell in the matrix; c) assigning aset of N indexed read pointers, each read pointer pointing to a singleassociated sorted hash list; d) setting each read pointer to the firstentry of the associated hash list; e) determining the index number(s) ofthe read pointer(s) pointing to the lowest hash value when processingthe lists in ascending order, or determining the index number(s) of theread pointer(s) pointing to the highest hash value when processing thelists in descending order; f) incrementing the value of cells C_(ij) inthe matrix having indices i,j, wherein i and j each correspond to any ofthe index numbers determined in step e); g) incrementing the readpointer(s) having the index number(s) determined in step e) to point tothe next different hash value(s) and repeating steps e) and f); and h)repeating step g) until the last read pointer(s) points to the lastentry of the associated hash list; determining which cell C_(pq) and/orC_(qp) indicates closest conformity between lists p and q, wherein theindex p corresponds to the to-be-assessed column; and determining thetype of the data entries in the to-be-assessed column to be similar tothe known type of the data entries in the hash list corresponding to theother index q.
 20. The method of claim 19, wherein determining whichcell C_(pq) and/or C_(qp) indicates closest conformity, includesdetermining which cell C_(pq) and/or C_(qp) has the highest value.